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Abstract — In this paper, we focus on formal synthesis of 
control policies for finite Markov decision processes with 
non-negative real-valued costs. We develop an algorithm to 
automatically generate a policy that guarantees the satisfaction 
of a correctness specification expressed as a formula of Linear 
Temporal Logic, while at the same time minimizing the expected 
average cost between two consecutive satisfactions of a desired 
property. The existing solutions to this problem are sub-optimal. 
By leveraging ideas from automata-based model checking and 
game theory, we provide an optimal solution. We demonstrate 
the approach on an illustrative example. 

I. INTRODUCTION 

Markov Decision Processes (MDP) are probabilistic mod- 
els widely used in various areas, such as economics, biology, 
and engineering. In robotics, they have been successfully 
used to model the motion of systems with actuation and 
sensing uncertainty, such as ground robots [17], unmanned 
aircraft [21], and surgical steering needles [1]. MDPs are cen- 
tral to control theory [4], probabilistic model checking and 
synthesis in formal methods [3], [9], and game theory [13]. 

MDP control is a well studied area (see e.g., [4]). The 
goal is usually to optimize the expected value of a cost 
over a finite time (e.g., stochastic shortest path problem) 
or an average expected cost in infinite time (e.g., average 
cost per stage problem). Recently, there has been increasing 
interest in developing MDP control strategies from rich 
specifications given as formulas of probabilistic temporal 
logics, such as Probabilistic Computation Tree Logic (PCTL) 
and Probabilistic Linear Temporal Logic (PLTL) [12], [17]. 
It is important to note that both optimal control and temporal 
logic control problems for MDPs have their counterpart in 
automata game theory. Specifically, optimal control translates 
to solving l!/2-player games with payoff functions, such 
as discounted-payoff and mean-payoff games [6]. Temporal 
logic control for MDPs corresponds to solving li/2-player 
games with parity objectives [2]. 

Our aim is to optimize the behavior of a system subject to 
correctness (temporal logic) constraints. Such a connection 
between optimal and temporal logic control is an intriguing 
problem with potentially high impact in several applications. 
Consider, for example, a mobile robot involved in a persistent 
surveillance mission in a dangerous area under tight fuel or 
time constraints. The correctness requirement is expressed as 
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a temporal logic specification, e.g., "Keep visiting A and then 
B and always avoid C". The resource constraints translate to 
minimizing a cost function over the feasible trajectories of 
the robot. Motivated by such applications, in this paper we 
focus on correctness specifications given as LTL formulae 
and optimization objectives expressed as average expected 
cumulative costs per surveillance cycle (ACPC). 

The main contribution of this work is to provide a correct 
and complete solution to the above problem. This paper can 
be seen as an extension of [18], [19], [11], [8]. In [18], we 
focused on deterministic transition systems and developed a 
finite-horizon online planner to provably satisfy an LTL con- 
straint while optimizing the behavior of the system between 
every two consecutive satisfactions of a given proposition. 
We extended this framework in [19], where we provided 
an algorithm to optimize the long-term average behavior 
of deterministic transition systems with time-varying events 
of known statistics. The closest to this work is [11], where 
the authors focus on a problem of optimal LTL control of 
MDPs with real-valued costs on actions. The correctness 
specification is assumed to include a persistent surveillance 
task and the goal is to minimize the long-term expected 
average cost between successive visits of the locations under 
surveillance. Using dynamic programming techniques, the 
authors design a solution that is sub-optimal in the general 
case. In [8], it is shown that, for a certain fragment of LTL, 
the solution becomes optimal. By using recent results from 
game theory [5], in this paper we provide an optimal solution 
for full LTL. 

The rest of the paper is organized as follows. In Sec. [II] 
we introduce the notation and provide necessary definitions. 



The problem is formulated in Sec. Ill The main algorithms 
together with discussions on their complexity are presented 
in Sec. IV Finally, Sec. [V] contains experimental results. 



II. Preliminaries 

For a set S, we use S" and S + to denote the set of all 
infinite and all non-empty finite sequences of elements of S, 
respectively. For a finite sequence r = do .. .a n € S + , we 
use |t| = n + 1 to denote the length of t. For < i < n, 
t{i) = at and = ao . . . a-, is the finite prefix of r of length 
i + We use the same notation for an infinite sequence from 
the set S". 

A. MDP Control 

Definition 1: A Markov decision process (MDP) is a 
tuple Ai = (S, A, P, AP, L, g), where S is a non-empty 
finite set of states, A is a non-empty finite set of actions, 
P : S x A x S —> [0, 1] is a transition probability function 



such that for every state s <G S and action a e A it 

holds that 2~2 s < e s P ( s > a > s ') e i ' 1 1> Ap is a finite set of 
atomic propositions, L : S — > 2 AP is a labeling function, and 
<?: S 1 x ^4 — > Rq" is a cost function. An initialized Markov 
decision process is an MDP M = (S, A, P, AP, L, g) with 
a distinctive initial state s init <G S. 

An action a e A is called enabled in a state s £ S if 
Ss'es P( s ' a ' s ') ~ 1- With a slight abuse of notation, A(s) 
denotes the set of all actions enabled in a state s. We assume 
A(s) 7^ for every s e S. 

A run of an MDP A4 is an infinite sequence of states 
p = sqS\ ... e S u such that for every i > 0, there exists 
a, G ^4(sj), P(sj, Qj, Sj+i) > 0. We use Run (s) to denote 
the set of all runs of Ai that start in a state s <G S. Let 
Rur/" 1 = U se ,s Rur/^s). A finite run a = s . . . s n € S+ 
of A'Jis a finite prefix of a run in M and Run^(s) denotes 
the set of all finite runs of Ai starting in a state s € S. Let 

Run£J = U ses Run £!( s )- The len S th l°1 = n + 1 of a finite 
run a = s ■ ■ ■ s n is also referred to as the number of stages 
of the run. The last state of a is denoted by last(cr) = s n . 

The word induced by a run p = s si . . . of Ai is an infinite 
sequence L{s§)L{s\) . . . G (2 ap ) w . Similarly, a finite run of 
AA induces a finite word from the set (2 AP ) + . 

Definition 2: Let Ai = (S, A, P, AP, L, g) be an MDP. 
An end component (EC) of the MDP Ai is an MDP Af = 
(SM,A Nl I>\ N ,AF,L\ Ml g\x) such that I / % C S, 
7^ Ax C A. For every s <G SV and a e 4/v( s ) it holds 
that {s' e 5 | P(s,a,s') > 0} C 5aA- For every pair of 
states s, s' e S 1 ^, there exists a finite run cr e Run A ^ 1 (s) 
such that last(a) = s'. We use P\jj to denote the function 
P restricted to the sets SV and Ajs/. Similarly, we use 
L\x and g\x with the obvious meaning. If the context 
is clear, we only use P 7 L,g instead of P|aA, L\aT, gW- 
EC Af of M. is called maximal (MEC) if there is no EC 
Af' = (S^/,Ajv/,P,AP,L,5) of M such that A/"' ^ Af, 
Sat C S'a/'' and Aaa(s) C AaA'(s) for every s e 5aA- The set 
of all end components and maximal end components of A4 
are denoted by EC(A / J) and MEC(A / J), respectively. 

The number of ECs of an MDP A4 can be up to exponen- 
tial in the number of states of Ai and they can intersect. On 
the other hand, MECs are pairwise disjoint and every EC is 
contained in a single MEC. Hence, the number of MECs of 
Ai is bounded by the number of states of A4. 

Definition 3: Let Ai = (5, A, P, AP, L, g) be an MDP. 
A control strategy for Ai is a function C : Runj^ — > A such 
that for every a <G Run^ it holds that C(a) e A(last(a)). 

A strategy C for which C(a) = C(cr') for all finite 
runs a, a' e Run^ with last(a) — last(a') is called 
memoryless. In that case, we consider C to be a function 
C : S — > A. A strategy is called finite-memory if it is defined 
as a tuple C = (M, act, A, start), where M is a finite 
set of modes, A: M x S — > AI is a transition function, 
act: M x S — > A selects an action to be applied in AI and 
start: 5 — >• M selects the starting mode for every s <E S. 

A run p = sqSi . . . S Run^ of an MDP is called a run 
under a strategy C for Ai if for every i > 0, it holds that 
P(s,, C(pW), Sj+i) > 0. A finite run under C is a finite 



prefix of a run under C. The set of all infinite and finite 
runs of Ai under C starting in a state s e S are denoted by 
Run- M ' c (s) and Run^' C (s), respectively. Let Run M ' C = 
\J seS Run M > c (s) and Run^ c = {j seS Run^ c ( S ). 

Let A'l be an MDP, s a state of Ai, and C a strategy 
for Ai. The following probability measure is used to argue 
about the possible outcomes of applying C in Ai starting 
from s. Let a £ Run^|' C (s) be a finite run. A cylinder set 
Cyl(cr) of a is the set of all runs of Ai under C that have a 
as a finite prefix. There exists a unique probability measure 
Pry 1, on the cr-algebra generated by the set of cylinder sets 
of all runs in Run A ^' C '(s). For a = s ■ ■ ■ s n E Run A ^' C (s), 
it holds 

n-l 

and Pr^ 1 ' C (Cyl(s)) = 1. Intuitively, given a subset X C 
Run M ' C (s), Yi^^iX) is the probability that a run of AI 
under C that starts in s belongs to the set X. 

The following properties hold for any MDP Ai (see, e.g., 
[3]). For every EC Af of Ai, there exists a finite-memory 
strategy C for Ai such that A4 under C starting from any 
state of Af never visits a state outside Af and all states of Af 
are visited infinitely many times with probability 1. On the 
other hand, having any, finite-memory or not, strategy C, a 
state s of X and a run p of Ai under C that starts in s, 
the set of states visited infinitely many times by p forms an 
end component. Let ec C EC(A4) be the set of all ECs of 
Ai that correspond, in the above sense, to at least one run 
under the strategy C that starts in the state s. We say that 
the strategy C leads Ai from the state s to the set ec. 

B. Linear Temporal Logic 

Definition 4: Linear Temporal Logic (LTL) formulae over 
a set AP of atomic propositions are formed according to the 
following grammar: 

(j) ::= true |a|^|0V0|<M</>|X0|0U</>|G0|F</>, 

where a <G AP is an atomic proposition, -i, V and A are 
standard Boolean connectives, and X {next), U {until), G 
{always), and F {eventually) are temporal operators. 

Formulae of LTL are interpreted over the words from 
(2 AP ) W , such as those induced by runs of an MDP M (for 
details see e.g., [3]). For example, a word w G (2 AP ) W 
satisfies G <j> and F <j> if <\> holds in w always and eventually, 
respectively. If the word induced by a run p e Run M 
satisfies a formula (f>, we say that the run p satisfies (j). With 
slight abuse of notation, we also use states or sets of states 
of the MDP as propositions in LTL formulae. 

For every LTL formula (f>, the set of all runs of Ai that 
satisfy cf> is measurable in the probability measure Pr^ 1 '*" 7 
for any C and s [3]. With slight abuse of notation, we use 
LTL formulae as arguments of Pr^'* 7 . If for a state s e S it 
holds thatPr^O) = 1, we say that the strategy C almost- 
surely satisfies (f> starting from s. If Ai is an initialized MDP 
and Pr^'^(0) = 1, we say that C almost-surely satisfies (j). 



The LTL control synthesis problem for an initialized MDP 
Ai and an LTL formula <f> over AP aims to find a strategy for 
Ai that almost-surely satisfies <fr. This problem can be solved 
using principles from probabilistic model checking [3], [12]. 
The algorithm itself is based on the translation of to a 
Rabin automaton and the analysis of an MDP that combines 
the Rabin automaton and the original MDP Ai. 

Definition 5: A deterministic Rabin automaton (DRA) is 
a tuple A = (Q, 2 AP , S, qo, Ace), where Q is a non-empty 
finite set of states, 2 AP is an alphabet, 5: Q x 2 AP — > Q is 
a transition function, q € Q is an initial state, and Acc C 
2<2 x 2<2 is an accepting condition. 

A run of A is a sequence qoqi . . . £ Q u such that for every 
i > 0, there exists A t e 2 AP , 8{q il A i ) = q l+1 . We say that 
the word AqAi . . . e (2 ) w induces the run go9i • ■ ■■ A run 
of A is called accepting if there exists a pair (B, G) € Acc 
such that the run visits every state from B only finitely many 
times and at least one state from G infinitely many times. 

For every LTL formula <f> over AP, there exists a DRA A^ 
such that all and only words from (2 AP ) U satisfying (f> induce 
an accepting run of A^ [14]. For translation algorithms see 
e.g., [16], and their online implementations, e.g., [15]. 

Definition 6: Let Ai = (S,A,P,AP,L,g) be an ini- 
tialized MDP and A = (Q,2 AP , 8, q , Acc) be a DRA. 
The product of Ai and A is the initialized MDP 
V = (S V ,A,P V , kVp,Lp,g v ), where Sp = S x Q, 
P v ((s,q),a,(s',q')) = P(s,a,s') if q' = 6(q,L(s)) and 
otherwise, APp = Q, Lp((s,q)) = q, g-p((s,q),a) — 
g(s,a). The initial state of V is s Vimt = {s init ,q ). 

Using the projection on the first component, every (finite) 
run of V projects to a (finite) run of Ai and vice versa, 
for every (finite) run of Ai, there exists a (finite) run of V 
that projects to it. Analogous correspondence exists between 
strategies for V and Ai. It holds that the projection of a 
finite-memory strategy for V is also finite-memory. More 
importantly, for the product V of an MDP AA and a DRA 
A<$, for an LTL formula </>, the probability of satisfying the 
accepting condition Acc of A^ under a strategy C-p for V 
starting from the initial state s-pinit, '•<?., 

Pr £S t ( V (FGH?) A GFG)), 

(B,G)&Acc 

is equal to the probability of satisfying the formula <p in the 
MDP Ai under the projected strategy C starting from the 
initial state Sj„,t. 

Definition 7: Let V = (Sp,A,P-p,AP-p,L-p,g-p) be the 
product of an MDP Ai and a DRA A. An accepting end 
component (AEC) of V is defined as an end component 
A/" = {S/j, A^f, P-pj AP-p, L-p, g-p) of V for which there 
exists a pair (B, G) in the acceptance condition of A such 
that L v (Sm) n B = and L v {Sm) n G 7^ 0. We say that M 
is accepting with respect to the pair (B,G). An AEC Af = 
(Sj^,A_\f,Pp,APp,L-p,g-p) is called maximal (MAEC) if 
there is no AEC Af' — (Sfj> , A P-p, AP-p, L-p, g-p) such 
that Af' £ Af, SV c SV, 4v((*,«)) C A^,(M) 
for every (s, (?) € Sp and AT and A/"' are accepting with 
respect to the same pair. We use AEC(V) and MAEC(:P) to 



denote the set of all accepting end components and maximal 
accepting end components of V, respectively. 

Note that MAECs that are accepting with respect to the 
same pair are always disjoint. However, MAECs that are 
accepting with respect to different pairs can intersect. 

From the discussion above it follows that a necessary 
condition for almost-sure satisfaction of the accepting con- 
dition Acc by a strategy C-p for V is that there exists a 
set maec C MAEC(T') of MAECs such that C P leads the 
product from the initial state to maec. 

III. Problem Formulation 

Consider an initialized MDP M = (S,A,P,AP,L,g) 
and a specification given as an LTL formula <f> over AP of 
the form 

= <£>AGF7r sur , (1) 

where 7r sur € AP is an atomic proposition and ip is an 
LTL formula over AP. Intuitively, a formula of such form 
states two partial goals - mission goal ip and surveillance 
goal GF7r sur . To satisfy the whole formula the system 
must accomplish the mission and visit the surveillance states 
S sur = {s € S | 7r sur 6 L(s)} infinitely many times. 
The motivation for this form of specification comes from 
applications in robotics, where persistent surveillance tasks 
are often a part of the specification. Note that the form in 
Eq. ([Tji does not restrict the full LTL expressivity since every 
LTL formula <\>\ can be translated into a formula <j>2 of the 
form in Eq. ([T]i that is associated with the same set of runs 
of Ai. Explicitly, <f>2 — 4>i A GF7r sur , where 7r sur is such 
that 7r sm . € L(s) for every state s 6 S. 

In this work, we focus on a control synthesis problem, 
where the goal is to almost-surely satisfy a given LTL speci- 
fication, while optimizing a long-term quantitative objective. 
The objective is to minimize the average expected cumulative 
cost between consecutive visits to surveillance states. 

Formally, we say that every visit to a surveillance state 
completes a surveillance cycle. In particular, starting from 
the initial state, the first visit to S sur completes the first 
surveillance cycle of a run. We use jj(cr) to denote the number 
of completed surveillance cycles in a finite run a plus one. 
For a strategy C for A4, the cumulative cost in the first n 
stages of applying C to Ad starting from a state s € S is 

n 

g M ,c(s,n) = '£g(*% c (i),C(aZk c(i) )), 

i=0 

where a 1 ^ is the random variable whose values are finite 
runs of length n + 1 from the set Run^' (s) and the 
probability of a finite run a is Pr A4 ' C (Cyl(cr)). Note that 
9M.c( s , n ) is a ls° a random variable. Finally, we define 
the average expected cumulative cost per surveillance cycle 
(ACPC) in the MDP M under a strategy C as a function 
Vm,c ■ S -> Eq such that for a state s G S 

t t ( \ y j?f9M,c(s,n)\ 



The problem we consider in this paper can be formally stated 
as follows. 

Problem 1: Let M — (S, A, P, AP. L, g) be an initialized 
MDP and cf) be an LTL formula over AP of the form in 
Eq. ([TJ. Find a strategy C for A4 such that C almost- 
surely satisfies (j> and, at the same time, C minimizes the 
ACPC value VM,c( s init) among all strategies almost-surely 
satisfying tj>. 

The above problem was recently investigated in [11]. 
However, the solution presented by the authors is guaranteed 
to find an optimal strategy only if every MAEC Af of the 
product V of the MDP M and the DRA for the specification 
satisfies certain conditions (for details see [11]). In this paper, 
we present a solution to Problem [T] that always finds an 
optimal strategy if one exists. The algorithm is based on 
principles from probabilistic model checking [3] and game 
theory [5], whereas the authors in [11] mainly use results 
from dynamic programming [4]. 

In the special case when every state of A4 is a surveillance 
state, Problem [T] aims to find a strategy that minimizes 
the average expected cost per stage among all strategies 
almost-surely satisfying <f>. The problem of minimizing the 
average expected cost per stage (ACPS) in an MDP, without 
considering any correctness specification, is a well studied 
problem in optimal control, see e.g., [4]. It holds that 
there always exists a stationary strategy that minimizes the 
ACPS value starting from the initial state. In our approach to 
Problem [T] we use techniques for solving the ACPS problem 
to find a strategy that minimizes the ACPC value. 

IV. Solution 

Let M = (S,A,P,AP,L,g) be an initialized MDP and 
4> an LTL formula over AP of the form in Eq. ([TJ. To 
solve Problem [T] for A4 and tj> we leverage ideas from 
game theory [5] and construct an optimal strategy for AA 
as a combination of a strategy that ensures the almost- 
sure satisfaction of the specification tj> and a strategy that 
guarantees the minimum ACPC value among all strategies 
that do not cause immediate unrepairable violation of (j>. 

The algorithm we present in this section works with the 
product V = (S-p,A,P-p, AP V , L v , g v ) of the MDP M and 
a deterministic Rabin automaton — (Q,2 AP ,S,qo,Acc) 
for the formula 0. We inherit the notion of a surveillance 
cycle in V by adding the proposition 7r SUI to the set AP-p 
and to the set Lp((s,q)) for every (s,q) £ S-p such that 
7r sur € L(s). Using the correspondence between strategies 
for V and Ad, an optimal strategy C for A4 is found as 
a projection of a strategy Cp for V which almost-surely 
satisfies the accepting condition Acc of A<f, and at the same 
time, minimizes the ACPC value Vp ,c v {s-pinit) among all 
strategies for V that almost-surely satisfy Acc. 

Since Cp must almost-surely satisfy the accepting con- 
dition Acc, it leads from the initial state of V to a set of 
MAECs. For every MAEC Af, the minimum ACPC value 
VTr((s,q)) that can be obtained in Af starting from a state 
(s, q) £ SV is equal for all the states of Af and we denote 
this value Vjj-. The strategy Cp is constructed in two steps. 



First, we find a set maec* of MAECs of V and a strategy 
Co that leads V from the initial state to the set maec*. We 
require that Co and maec* minimize the weighted average 
of the values Vj!/ for Af £ maec*. The strategy Cp applies 
Co from the initial state until V enters the set maec*. 

Second, we solve the problem of how to control the 
product once a state of an MAEC Af £ maec* is visited. 
Intuitively, we combine two finite-memory strategies, Cfj for 
the almost-sure satisfaction of the accepting condition Acc 
and Cjf for maintaining the average expected cumulative 
cost per surveillance cycle. To satisfy both objectives, the 
strategy Cp is played in rounds. In each round, we first 
apply the strategy Cfj- and then the strategy Cjj-, each for a 
specific (finite) number of steps. 

A. Finding an optimal set of MAECs 

Let MAEC("P) be the set of all MAECs of the product 
V that can be computed as follows. For every pair (B, G) £ 
Acc, we create a new MDP from V by removing all its 
states with label in B and the corresponding actions. For the 
new MDP, we use one of the algorithms in [10], [9], [7] to 
compute the set of all its MECs. Finally, for every MEC, we 
check whether it contains a state with label in G. 

In this section, the aim is to find a set maec* C MAEC('P) 
and a strategy Co for V that satisfy conditions formally stated 
below. Since the strategy Co will only be used to enter the 
set maec*, it is constructed as a partial function. 

Definition 8: A partial strategy £ for the MDP A4 is a 
partial function Rurig n — > A, where if £(<r) is defined for 
a £ Rmifl n , then ((a) £ A(last(a)). 

A partial stationary strategy for A4 can also be considered 
as a partial function ( : S — > A or a subset ( C S x A. The set 
Run^'^ of runs of A4 under £ contains all infinite runs of 
A4 that follow ( and all those finite runs a of A4 under £ for 
which ((last(a)) is not defined. A finite run of A4 under £ is 
then a finite prefix of a run under £. The probabilit y mea sure 

~ We 
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is defined in the same manner as in Sec. 



II-A 



also extend the semantics of LTL formulas to finite words. 
For example, a formula FG </> is satisfied by a finite word if 
in some non-empty suffix of the word always holds. 

The conditions on maec* and Co are as follows. First, the 
partial strategy Co leads V to the set maec*, i.e., 



Pr££ t (FG( (J 



SV)) = 1. 



(2) 



Second, we require that maec* and Co minimize the value 

> t (FGSV)-V£. (3) 
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The procedure to compute the optimal ACPC value for 
an MAEC Af of V is described in the next section. Assume 
we already computed this value for each MAEC of V . The 
algorithm to find the set maec* and partial strategy Co is 
based on an algorithm for stochastic shortest path (SSP) 
problem. The SSP problem is one of the basic optimization 
problems for MDPs. Given an initialized MDP and its state 
t, the goal is to find a strategy under which the MDP almost- 
surely reaches the state t, so called terminal state, while 



minimizing the expected cumulative cost. If there exists at 
least one strategy almost-surely reaching the terminal state, 
then there exists a stationary optimal strategy. For details and 
algorithms see e.g., [4]. 

The partial strategy Cq and the set maec* are computed as 
follows. First, we create a new MDP V' from V by consider- 
ing only those states of V that can reach the set MAEC('P) 
with probability 1 and their corresponding actions. The MDP 
V' can be computed using backward reachability from the set 
MAEC('P). If V' does not contain the initial state s-pinu, 
there exists no solution to Problem [T] Otherwise, we add 
a new state t and for every MAEC Af £ MAEC CP") = 
MAEC('P), we add a new action ajsf to V '. From each state 
(s, q) £ Sj*s,Af G MAEC('P'), we define a transition under 
aj\f to t with probability 1 and set its cost to Vjj-. All other 
costs in the MDP are set to 0. Finally, we solve the SSP 
problem for V' and the state t as the terminal state. Let 
Cssp be the resulting stationary optimal strategy for V' . 
For every (s,q) € S v , we define C ((s,q)) = C S sp{{s,q)) 
if the action Cssp({s,q)) does not lead from (s, q) to t, 
Co((s,q)) is undefined otherwise. The set maec* is the set 
of all MAECs Af for which there exists a state (s, q) such 
that C SS p((s,q)) = a^. 

Proposition 1: The set maec* and the partial stationary 
strategy Cq resulting from the above algorithm satisfy the 
conditions in Eq. |2]i and Eq. Q. 

Proof: Both conditions follow directly from the fact 
that the strategy Cssp is an optimal solution to the SSP 
problem for V' and t. ■ 

B. Optimizing ACPC value in an MAEC 

In this section, we compute the minimum ACPC value 
that can be attained in an MAEC Af € MAEC('P) and 
construct the corresponding strategy for Af. Essentially, we 
reduce the problem of computing the minimum ACPC value 
to the problem of computing the minimum ACPS value by 
reducing Af to an MDP such that every state of the reduced 
MDP is labeled with the surveillance proposition 7r sul . 

Let Af = (SV, A/vr,P-p, &Pv,Lv,9v) be an MAEC of 
V. Since it is an MAEC, there exists a state (s, q) E Sj^ 
with 7r sur e L-p((s, <?)). Let Sj\f snt denote the set of all such 
states in SaA- We reduce Af to an MDP 

Af SUT = (SV sur , A sur , P sur , APp, Lp, g sur ) 

using Alg. [T] For the sake of readability, we use singletons 
such as v instead of pairs such as (s,q) to denote the states 
of Af. The MDP Af SU i is constructed from Af by eliminating 
states from S_\f\S_\f Bur one by one in arbitrary order. The 
actions A sur are partial stationary strategies for Af in which 
we remember all the states and actions we eliminated. Later 
we prove that the transition probability P sur (i>, C> v ') f° r 
states v,v' 6 SaT sut an d an action £ € A sur (i>) is the 
probability that in Af under the partial stationary strategy £, 
if we start from the state v, the next state that will be visited 
from the set Sw; ur is the state v', i.e., the first surveillance 
cycle is completed by visiting v' . The cost g SU r(v,() is the 



expected cumulative cost gained in Af using partial stationary 
strategy £ from v until we reach a state in Sjv; ur . 

In Fig. [T] we demonstrate the reduction on an example 
using the notation introduced in Alg. [T] On the left side, we 
see a part of an MAEC Af with five states and two actions. 
First, we build an MDP X = (S x , A x , P x , APp, L v , g x ) 
from Af by transforming every action of every state to a 
partial stationary strategy with a single pair given by the 
state and the action. The MDP X is used in the algorithm as 
an auxiliary MDP to store the current version of the reduced 
system. Assume we want to reduce the state v. We consider 
all "incoming" and "outgoing" actions of v and combine 
them pairwise as follows. There is only one outgoing action 
from v in X, namely (, and only one incoming action, 
namely action ( id of state Vf rom . Since £ and C, id do not 
conflict as partial stationary strategies on any state of Af, 
we merge them to create a new partial stationary strategy 
Cnew that is an action of Vf rom . The transition probability 
P x(v from, (new, v to ) for a state v to of X is computed as 
the sum of the transition probability P x (vf rom , (old, v to) of 
transiting from Vf rom to v to using the old action ( Q id and 
the probability of entering v to by first transiting from v f rom 
to v using C, id and from v eventually reaching v to using £. 
The cost g x (vf rom ,(new) is the expected cumulative cost 
gained starting from Vf rom by first applying action ( id 
and if we transit to v, applying ( until a state different 
from v is reached. Now that we considered every pair of 
an incoming and outgoing action of v, the state v and its 
incoming and outgoing actions are reduced. The modified 
MDP X is depicted on the right side of Fig. [T] 

Proposition 2: Let Af — (SV, A/V> Ppj AP-p, Ly>,gp) be 
an MAEC and A4ui = (Sjv s „, A sur , P sur , AP-p, L v , g SUI ) 
its reduction resulting from Alg. [T] The minimum ACPC 
value that can be attained in A4ur starting from any of its 
states is the same and we denote it Vfr . There exists a 
stationary strategy Cjj- for N sur that attains this value 
regardless of the starting state in N sm - Both Vfc and Cjf 
can be computed as a solution to the ACPS problem for Af SUI . 
It holds that VJ^ = Vj^ and from Cjj- , one can construct 
a finite-memory strategy Cjf for Af which regardless of the 
starting state in Af attains the optimal ACPC value Vjj. 
Proof: We prove the following correspondence between 
and Af suv . For every v,v' € SV ur and £ e A sur (u), it 
holds that £ is a well-defined partial stationary strategy for 
Af. The transition probability P sur (w, (, v') is the probability 
that in Af, when applying £ starting from v, the first 
surveillance cycle is completed by visiting v' , i.e., 

PsurKCy) = Pr^' C (X(-S AAsur U U ')). 

The cost g SUI (v, £) is the expected cumulative cost gained in 
Af when applying £ starting from v until the first surveillance 
cycle is completed. On the other hand, for every partial 
stationary strategy ( for Af such that 

Pr^(FSV 3ur ) = l 

for some v € S^sur, there exists an action £' G A sur (v) 
such that the action £' corresponds to the partial stationary 




X 



X 



building X 




(new, 6.25 

0.225 



removing v 




Cold = {(v from, a)} 

C = {M)} 



G«, = {(«/*»>«). M)} 



Fig. 1 : Illustration of Alg. [TJ A part of an MAEC N is shown in the left. An auxiliary MDP X is constructed by transforming actions 
of Af to partial stationary strategies. The MDP X after eliminating the state v is shown on the right. The costs associated with actions 
are depicted in blue. 



strategy ( in the above sense, i.e., 

P sm (v,C,v') = Pr^(X(^ sui U«')) 

for every v' € £V 8Ut > an d the cost <7sur(^, C) is the expected 
cumulative cost gained in Af when we apply C starting from 
v until we reach a state in Sj\f em . 

To prove the first part of the correspondence above, 
we prove the following invariant of Alg. [TJ Let X = 
(Sx, Ax,Px, AP-p,L v ,g x ) be the MDP from the algo- 
rithm after the initialization, before the first iteration of the 
while cycle. It is easy to see that all actions of X are 
well-defined partial stationary strategies. For the transition 
probabilities, it holds that 



rom ? 



Pi#£ m (x(-.s x u« to )) 



for every Vf rom ,v to € S x and ( e Ax(v from ). The cost 
gx(vfrom, C) is the expected cumulative cost gained in Af 
starting from Vf rom when applying C until we reach a state 
in Sx- We show that these conditions also hold after every 
iteration of the while cycle. 

Let X satisfy the conditions above and let v £ Sx\Sj\fsux- 
By removing the state v from Sx, we obtain a new version 
of the MDP X' = (S x >,A x >,Px>,APv,L v ,gx>). Note 
that Sx< U {u} = Sx. Let w/ rom G Sx' be a state of X' and 
Cneto € Ax' (vf rom ) be its action such that Cnew has changed 
in the process of removing the state v. The action Cnew is a 
well-defined partial stationary strategy because it must have 
been created as a union of an action Cold of Vf rom and an 
action ( of v, both from the previous version X, which do 

not conflict on any state from Sx- 

x' 

Let — > Vto denote the LTL formula X(-i»Sx'Ut!t ). For a 
state v to G Sx 1 , we prove that 



Px<(Vfrom, Cnew, V to ) = Pi' 



from 



)• 



Since Cnew = Cold, U the probability in A/" when applying 
Cneto starting from Vf rom of reaching the state v to as the 
next state in Sx> is the probability of reaching it as the next 
state in Sx when using Cold from Vf rom , plus the probability 
of reaching v as the next state in Sx from w f rom using Coid 



and then eventually reaching the state v to from v using C. 
This means 

/X' 



oo 

■ (^P x (,;,C^) l -Px(f,C,fto)) 

i=0 

= ~Px(v from, (old, Vto) + 

, r> t a \ Px(v,C, fto) 

+ PX «/r»m > Coid . V) ■ — — — - 

1 - Px(V,(,V) 

which is exactly as defined in Alg. [TJ 

Similarly, we prove that gx'(vfrom, Cnew) is the expected 
cumulative cost gained in Af starting from Vf rom when 
applying Cnew until we reach a state in Sx>- As Cnew — 
Cold U C> it is the expected cumulative cost of reaching a 
state in Sx by using Cold plus, in the case we reach v, the 
expected cumulative cost of eventually reaching a state in 
Sx>, i-e., other than v, using C- To be specific, we have 

9X(V from, Cold) + P x(v from, Cold, «)• 

oo 

■(J2Px(v,C,vY ■ (l-P x (v,C,v)) ■ (i+l)-g x (v,C) 

i=0 



9x(v from, Cold) + Px(»/romi Cold, v) ■ 



9x(v,C) 
l--Px(v,C,v)' 



just as defined in Alg. [TJ This completes the proof of the 
first part of the correspondence between Af and Af sni . 

The second part of the correspondence between Af and 
A^ur follows directly from the fact that, in the process of re- 
moving a state v G Sx\Sj\f BUT , we consider all combinations 
of actions of v which eventually reach a state different from 
v, with all actions of all states v from having an action under 
which v is reached with non-zero probability. 

From the correspondence between Af and A4m it follows 
that in Af sur , there exists a finite run between every two 
states. Therefore, the minimum ACPC value that can be 
obtained in Af SU r from any of its states is the same and it is 
denoted by Vjv" sur ■ Since every state of Af SUI is a surveillance 



Algorithm 1 Reduction of an MAEC Af to Af SUT 

Input: Af = (SV, Av,P-p, AV v ,L v ,g v ) 
Output: A4ur = (SVsur, A sur ,P aur , AP V , L v , g SUT ) 
1: let X = (Sx, Ax,Px, APp,ip,g x ) be an MDP where 

• Sx '■= sv, 

• for v G Sx : 

Ax(v) := {Cc | C a = {(v,a)},a G Aat(«)}, 
. for v,v' e Sx, C & A x ■ 

• for v G Sx , C £ Ax : 
fl*(w,C) : = 9t{v,C{v)) 

2: while Sx\SAAmr / do 
3: let w G SV\SVsur 
4: for all £ G Ax («) do 
5: if P x (v,(,v) < 1 then 
6: for all Vjvoro G Sx,(oid G Ax(ti/ ram ) do 

7: if Px(w/v m,C w,u) > and ( id,( do not con- 

flict for any state from Sx then 

8: C,new '- = Cold U £ 

9: add ( new to Ax(«/ rom ) 

10: for every v to G Sx: 



Px (Vfrom, Cncra, "to) ~ Px(»/roi», Cold, V t o) + 



+ Px(Vf rom , Cold, V) 
gX (»/rom,(m») ~ SX (t>/rom , Co(d) + 

+ Px(«/™m, Cold, f) 

remove f oW from A x (vf r om) 
end if 
end for 
end if 

remove £ from Ax(») 
end for 

remove v from Sx 
end while 
return X 



Px(f,C.fto) 

l-Px(f,C") 

gx(f , C) 

l-P x (v,C,v) 



The strategy is a then finite-memory strategy 

Cjf = (Af, act, A, start), 

where M = SV 3ur U {init} is the set of modes, A : M X 
SV — >■ is the transition function such that for every m G 

M, v G SV 



A(m, 



m if v £ Sfj s 
v otherwise. 



The function act : M x SV — > Av that selects an action to 
be applied in Af is for m G M, v G SV defined as 



act(m, v) 



'(C% a Jm))(v) ifme%, 
( ini t(v) otherwise. 



Finally, start: SV —> S^ sur selecting the starting mode for 
V G SV is defined as 

{v if u € SV sur , 
to where (CK- (m))(v) 
is defined, 
init otherwise. 

The strategy attains the ACPC value Vfr since it only sim- 
ulates the strategy Cjy- by unwrapping the corresponding 
partial strategies. ■ 

The following property of the strategy CYj- is crucial for 
the correctness of our approach to Problem [T] 

Proposition 3: For every (s, q) G SV> it holds that 



lim P^fdp 



9v 



(p(im)) 



< = 1, 



state, the ACPC problem for Af sux is equivalent to solving 
the ACPS problem for Af sn r- Using one of the algorithms 
in [4], we obtain a stationary strategy Cjy that attains the 
ACPC value VJj- regardless of the starting state. From the 
correspondence between Af and Af SUT it also follows that 

V k„ = V k 

Now we construct the strategy Cjj for Af and show that it 
attains the minimum ACPC value Vfr regardless of the initial 
state. Intuitively, the strategy Cjj- is constructed to lead to 
a single EC of Af that provides the minimum ACPC value 
and that is the EC encoded by the strategy Cjj- for W sur . 

Let Sdef Q SV be the set of all states v G Saa for which 
there exists a surveillance state v sur G Sjv soi such that the 
partial strategy CYf (v sur ) for Af is defined on the state v. 
We compute a partial strategy Q n it that leads from every 
state from S^\Sd e f to the set Sdef as follows. Let Af' be 
an MDP that is created from Af by adding a new state t and a 
new action ddef- From every state v G Sdef, we define a new 
transition under otdef to t with probability 1 and cost 0. Let 
Cssp be a stationary optimal strategy for the SSP problem 
for Af' and t as the terminal state. We define Q n it(v) = 
Cssp{v) for every v G Sj\f\Sdef- 



where g-p(p^ n f) denotes the cumulative cost gained in the 
first n surveillance cycles of a run p G Run A ^((s, q)). Hence, 
for every e > 0, there exists j(e) G N such that if the strategy 
Cjf is applied from a state (s,q) G Sj^ for any I > j(e) 
surveillance cycles, then the average expected cumulative 
cost per surveillance cycle in these I surveillance cycles is 
e with probability at least 1 — e, i.e., 



at most VJj 



"(«,«) 



({p 



9r 



<YZr + e})>l-e. 



Proof: In [7] the authors prove that a strategy solving 
the ACPS problem for an MDP satisfies a property analogous 
to the one in the proposition. Especially, for the strategy 
Cjf for the reduced MDP Af anT , it holds that for any state 

(s, q) G Sw aur 



lim Pr 



0,9) 



({P 



3M 



< vkJ) = i. 



where gu aul (p^) denotes the cumulative cost gained in the 
first n stages of a run p G Run - ^ 3 ™ ((s, q)). The proposition 
then follows directly from the construction of the strategy 
Cjf from the strategy ■ 



C. Almost-sure acceptance in an MAEC 

Here we design a strategy for an MAEC TV G MAEC('P) 
that guarantees almost-sure satisfaction of the acceptance 
condition Acc of A$. Let (B, G) be a pair in Acc such that 
A^is accepting with respect to (B, G), i.e., L-p{Sjs/)C\B = 
and L-p(Sj^f) n G 7^ 0. There exists a stationary strategy Ct- 
for A/" under which a state with label in G is reached with 
probability 1 regardless of the starting state, i.e., 

M,C 



Pr7 ' f 

0.9) 



(F G) = 1 



(4) 



for every (s, g) € SV- The existence of such a strategy 
follows from the fact that Af is an EC [3]. Moreover, we 
construct Cfr to minimize the expected cumulative cost 
before reaching a state in SV H S x G. 

The strategy Cfj- is found as follows. Let Af' be an MDP 
that is created from Af by adding a new state t and a new 
action ag. From every state (s, q) £ SV H S x G, we define 
a new transition under to t with probability 1 and cost 
0. Let Cssp be a stationary optimal strategy for the SSP 
problem for Af' and t as the terminal state. For a state (s, q) G 
SV, we define G^((s,g)) = C SS p((s,q)) if the state (s,g) 
does not have a label in G, otherwise Ctr((s,q)) = a for 
some a G Aa/"((s, <?))■ 

Proposition 4: The strategy G^ for AT resulting from the 
above algorithm almost-surely reaches the set SV flSxG 
and minimizes the expected cumulative cost before reaching 
the set, regardless of the initial state. 

Proof: It follows directly from the fact that Cssp 
optimally solves the SSP problem for the MDP A"' and t. ■ 

D. Optimal strategy for V 

Finally, we are ready to construct the strategy Cp for the 
product V that projects to an optimal solution for A4. 

First, starting from the initial state s-piniu Cp applies the 
strategy Go resulting from the algorithm described in Sec. |IV-| 
|Al until a state of an MAEC in the set maec* is reached. Let 
A" G maec* denote the MAEC and let (B, G) G Acc be 
a pair from the accepting condition of A^ such that Af is 
accepting with respect to (B,G). 

Now, the strategy C-p starts to play the rounds. Each round 

consists of two phases. First, play the strategy Cfj- from 

Sec. IIV-CI until a state with label in G is reached. Let us 

denote fcj the number of steps we play Cfr in i-th round. 

iv. fr-nm Coo Ijy. 



The second phase applies the strategy CYr from Sec 
[B] until the number of completed surveillance cycles in the 
second phase of the current round is The number U is any 
natural number for which 



k > 



■ 9Vr 



J, 



where is from Prop. [3] and g-pmax is the maximum 

value of the costs g-p. After applying the strategy Cjf for Zj 
surveillance cycles, we proceed to the next round i + 1. 

Theorem 1: The strategy Cp almost-surely satisfies the 
accepting condition Acc of A$ and at the same time, 
Cp minimizes the ACPC value V-p : c v {s-pinit) among all 
strategies for V almost-surely satisfying Acc. 



Proof: From Prop. [TJ it follows that when applying the 
strategy Go from the initial state s-pinit, the set maec* is 
reached with probability 1. 

Assume that V enters MAEC Af G maec* that is accepting 
with respect to a pair (B,G) £ Acc. Let i be the current 
round of Cp and = i. According to Prop. |4j state 
with a label in G is almost-surely reached. In addition, 
using Prop. [3] the average expected cumulative cost per 
surveillance cycle in the i-th round is at most 

h ■ gUmax + k(Vp + €i) _ 



V, 



< v, 



(Ji > i ■ ki ■ gj\f m ax) 



V, 



with probability at least 1 — i. Therefore, in the limit, in the 
MAEC Af, we both satisfy the LTL specification and reach 
the optimal ACPC value with probability 1. Together with 
the fact that maec* and Go satisfy the condition in Eq. Q, 
we have that C-p is an optimal strategy for V . ■ 

E. Complexity and discussion 

The size of a Rabin automaton for an LTL formula <fi is in 
the worst case doubly exponential in the size of the set AP. 
However, studies such as [16] show that in practice, for many 
LTL formulas, automata are much smaller and manageable. 

Once the product V is built, we compute the set 
MAEC("P) by running |^cc|-times an algorithm for MEC 
decomposition, which is polynomial in the size of V. The 
size of the set MAEC("P) is in the worst case \Acc\ ■ \S V \. 
For each MAEC Af, we compute its reduction A^ur using 
Alg. [T]in time 0(|SV| ■ |4Ar|° (|SVI) ). The optimal ACPC 
value V^f and an optimal finite-memory strategy G^ are then 
found in time polynomial in the size of the reduced MDP. 

The algorithm for finding the strategy Go and the optimal 
set maec* are again polynomial in the size of V. Similarly, 
computing a stationary strategy ct- for an MAEC Af £ 
maec* is polynomial in the size of Af. 



As was proved in Sec. IV-D the presented solution to 
Problem [TJ is correct and complete. However, the resulting 
optimal strategy Cp for V, and hence the projected strategy 
G for AA as well, is not a finite-memory strategy in general. 
The reason is that in the second phase of every round i, the 
strategy Cjy- is applied for li surveillance cycles and is 
generally growing with i. 

This, however, does not prevent the solution to be ef- 
fectively used. The following simple rule can be applied 
to avoid performing all U > max{i • fc l • g Vmax , j ( i ) } 
surveillance cycles in every round i. When the computation 
is in the second phase of round i and the product is in an 
MAEC Af £ maec*, after completion of every surveillance 
cycle, we can check whether the average cumulative cost 
per surveillance cycle in round i is at most VV + ?. If yes, 
we can proceed to the next round i + otherwise continue 




job 

(a) (b) 

Fig. 2: (a) Initialized MDP Ai with initial state 0. The costs of applying a,/3,y in any states are 5, 10, 1, respectively, e.g., g(l,a) = 5. 
(b) Definitions of strategies d„it, C p i, C V 2 for Ai, the projections of strategies Co, Cfj, Cjj- for V, respectively. The condition "before 
job" means that the corresponding prescription is used if the job location has not yet been visited since the last visit of the base. Similarly, 
the prescription with condition "after j ob" is used if the job location was visited at least once since the last visit of the base. 



with the second phase of round i. As the simulation results 
in Sec. [V] show, the use of this simple rule dramatically 
decreases the number of performed surveillance cycles in 
almost every round. 

On the other hand, the complexity of the resulting strategy 
C for Ai can be reduced from non-finite-memory to finite- 
memory in the following case. Assume that for every M G 
maec*, the optimal ACPC strategy CYj- leads to an EC that 
contains a state from G, where Af is accepting with respect to 
the pair (B, G) € Acc. In this case, the optimal strategy C-p 
can be defined as a finite-memory strategy that first applies 
the strategy Cq to reach a state of an MAEC TV G maec*, 
and from that point on, only applies the strategy Cjj. 

V. Case Study 



We implemented the solution presented in Sec. IV in 



Java and applied it to a persistent surveillance robotics 
example [20]. In this section, we report on the simulation 
results. 

Consider a mobile robot moving in a partitioned en- 
vironment. The motion of the robot is modeled by the 
initialized MDP Ai shown in Fig. [2^. The set AP of atomic 
propositions contains two propositions base and job. As 
depicted in Fig. [2^, state is the base location and state 8 
is the job location. At the job location, the robot performs 
some work, and at the base, it reports on its job activity. 

The robot's mission is to visit both base and job location 
infinitely many times. In addition, at least one job must be 
performed after every visit of the base, before the base is 
visited again. The corresponding LTL formula is 

4> = GFbase A GF job A G(base => X(^baseU job)). 

While satisfying the formula, we want to minimize the 
expected average cost between two consecutive jobs, i.e., 
the surveillance proposition 7r SUI = job. 

In the simulation, we use a Rabin automaton A^ for the 
formula that has 5 states and the accepting condition contains 
1 pair. The product V of the MDP Ai and A$ has 50 states 
and one MAEC Af of 19 states. The optimal set of MAECs 
maec* = {Af}. The optimal ACPC value Vtr = 40.5. In 



Fig. ^i, we list the projections of strategies Co,Cfj-, C 



for V to strategies C^t, C p i, C P 2 for Ai, respectively. The 
optimal strategy C for Ai is then defined as follows. Starting 
from the initial state 0, apply strategy Gi n u until a state 
is reached, where Ci n i t is no longer defined. Start round 
number 1. In z-th round, proceed as follows. In the first phase 
of the round, apply strategy C v \ until the base is reached and 
then for one more step (the product V has to reach a state 
from the Rabin pair). Let ki denote the number of steps in 
the first phase of round i. In the second phase, use strategy 
C P 2 for li = max{i • fcj • 10, surveillance cycles, i.e., 

until the number of jobs performed by the robot is ij. We 
also use the rule described in Sec. lIV-El to shorten the second 
phase, if possible. 

Let us summarize the statistical results we obtained for 5 
executions of the strategy C for Ai, each of 100 rounds. The 
number fc,; of steps in the first phase of a round i > 1 was 
always 5 because in such case, the first phase starts at the job 
location and the strategy C v \ needs to be applied for exactly 
4 steps to reach the base. Therefore, in every round % > 1, 
the number U is at least 50 • i, e.g., in round 100, li > 5000. 
However, using the rule described in Sec. |IV-E| the average 
number of jobs per round was 130 and the median was only 
14. In particular, the number was not increasing with the 
round. On the contrary, it appears to be independent from 
the history of the execution. In addition, at most 2 rounds 
in each of the executions finished only at the point, when 
the number of jobs performed by the robot in the second 
phase reached Zj. The average ACPC value attained after 
100 rounds was 40.56. 

In contrast to our solution, the algorithm proposed in [11] 
does not find an optimal strategy for Ai. Regardless of the 
initialization of the algorithm, it always results in a sub- 
optimal strategy, namely the strategy C v \ from Fig. that 
has ACPC value 50.5. 
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